Recent progress (2020/03/11 ~ 2020/03/17)

In order to select significant motifs, we calculate the weighted L2 for each motif over all the conditions so that we could consider no only the regression coefficient but also the transcript frequency at the same time:[plot] \[ \text{c: condition}\\ \text{g: gene}\\[2ex] e_{mg}= \begin{cases} 1,\quad \text{at least one motif } m \text{ is on gene }g\\ 0, \quad otherwise \end{cases}\\ \] \[ \begin{align} \text{Variance predicted by motif} &= \sum_c \sum_g \beta _{mc}^2 \ \times \ e_{mg}\\ &= \sum_c \beta^2_{mc}\ \times\ transcript\ frequence\\ &= L_2^2\ \ of \ motif\ \times\ transcript\ frequence \end{align} \] By ordering and plotting the motif with the weighted L2, 6 significant motifs are selected.

Then, we selected 10 groups of models which include “heat shock”, “hypothermia”, “H2O2”, “sorbitol” etc where only the models represented environmental stress within 60mins are considered. mean L2 of each significant motif are computed from their regression coefficients to indicate how much regulatory effect they have under specific environmental stress:[plot] \[ mean\ L_2 = \sqrt{{\sum_c^N \beta _{mc}^2}\over \text{N}} \]

However, the overall regulatory effect in a cell also depends on how strong the environmental stress is. For example, the change in expression level under heat shock from 25 degree to 37 degree should be much larger than that under heat shock from 29 degree to 33 degree. Therefore, we calculate the proportion of variance explain by motifs over the total variance to reduce such difference:[plot] \[ \begin{align} \text{Variance pridict by motif} &= \sum_c({1\over G}\times \sum_g \beta _{mc}^2 \times e_{mg} - ({1\over G}\times\sum_g \beta _{mc} \times e_{mg})^2)\\ &= \sum_c (\beta^2_{mc}\ \times\ percentage\ of\ transcript\ frequence- (\beta_{mc} \times percentage\ of\ transcript\ frequence)^2) \\ \end{align} \]

\[ \text{proportion of variance explain by motif } = {\text{variance explain by motifs}\over \text{total variance}} \] We plot bar graphs to compare the value across each group of environmental stresses. The shape of the proportion variance graph is similar to that of the mean of L2. There is still a large difference between the value of heat shock from 29C to 33C and heat shock from 25C to 37C.

Previous progress (2020/03/04 ~ 2020/03/10)

We apply downstream sequence with fix-length (120 nucleotides) from Samuel as 3’UTR to reduce the loss of the expression level data of 548 genes which is not present in both Cheng’s and fungiDB datasets. 503 downstream sequences are applied (3’UTR of 45 genes are not present in all three of the dataset).

To virtualize the data and compare the coefficient across different type of models, we plotted box graph which has coefficient as y-axis and motif as x-axis where each data point come from one coefficient of a model and colored by its group (e.g. heat shock group, hypothermia group).

As another approach from comparing the motifs’ regulatory effect between different types of environmental stresses, we compare the absolute value of their regression coefficients. Motifs are ranked in each model where rank 1 is given to the motifs with the highest absolute value of coefficient. Then, for each group of environmental stress, we compute the average rank of each motif weighted by L2 of the coefficients in each model. Therefore, the rank from a model with motifs with low regression coefficients will get less weight rather than that of a model with motifs with high regression coefficients.

At last, we compare the model’s bias with the mean of Log2 expression level. The bias of the model can be considered as the expected gene expression level change of a gene with no cis-regulatory element we considered in the model. In other words, the bias indicates the expression level change that cannot be explained by the model and could indicate the estimated proportion of the efforts on regulation these cis-regulatory elements make.

previous progress (2020/02/26 ~ 2020/03/03)

followed a suggestion from the group meeting, I replace the baseline of imputation for missing value from “replace missing value by zero” to “replace missing value by mean of column”. As a result, there is not an obvious difference between the performance of these two baseline approaches.

we downloaded downstream sequence with fix-length (120 nucleotides) as 3’UTR to reduce loss of the expression level data of 1868 genes causing by inner joining the expression level of 6152 genes with the 4388 motif frequency data. 1320 downstream sequences are downloaded from fungiDB. These downloaded sequences are modified and saved as file 3UTR_fungiDB under the data folder. The length of 120 nucleotides is a sensible number just below the median length (suggested by Edward Wallace). As a comparison, we also calculated the mean and median of the 3’UTR length from Cheng et al.’s data, which is 144 and 119 respectively. Finally, we plot a bar graph to compare the frequency of the motif between Cheng’s and fungiDB’s dataset. Since the frequency of most motifs is quite similar in both datasets, we believe the data from fungiDB is reliable enough to be applied.

Furthermore, we compare the amount of transcript include each motif and the coefficient of it. The log10 transcript frequency is plotted. The regression coefficient of motifs with a low frequency could be unreliable even it is given a significant value.

project discription

Gene expression is the process through which cells generate diversity and resilience to environmental changes. It consists of two steps: transcription, where an intermediate molecule (mRNA) is produced from the DNA, and translation of mRNA into protein. The rate of transcription and translation can be regulated by regulatory elements acting in both cis and trans. 3’UTR, 5’UTR, and promoter are known as powerful regulatory processes that determine the rate of mRNA synthesis and decay.

In this project, we are interested in using these cis-regulatory features to predict multidimensional output that indicates how the mRNA abundance change in multiple environmental conditions. Since our goal is to quantify the effect of different elements, we should avoid clustering and go straight from element to expression pattern. Linear models with multidimensional lasso (known as an algorithm of multi-task learning) which is already implemented in the glmnet package should be a good start. As an extension, we may apply to multiple states of mRNA processing like splicing, decay, and translation by combining multiple data sets. Finally, we can carry out a quantitative analysis of the contribution of motifs in different types of CRE and ask which regions make more of a contribution to dynamic changes in gene expression.

Index

Part 1: setup and investigate the dataset

After loading all the required library and function, we investigate and evaluate the dataset we have.

Load library and function

library(data.table)
## Warning: package 'data.table' was built under R version 3.6.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(lmodel2)
library(splitstackshape)
library(gridExtra)
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
library(impute)

Construct the design matrix

Expression level change of 6152 distinct genes under 173 different environmental stresses from Gasch’s study

In this project, we apply the data describes gene expression level in various environment condition from Gasch’s study Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. To figure out how the expression level change, DNA microarrays were used to measure the relatively abundance of mRNA before and after a series of environmental stresses.

The 69 candidate sequence motifs we use in this project is from 3 different studies:
  1. 53 of them come from the study of Shalgi et al (2005). The candidate motifs are derived by analyzing the exhaustively enumerating all k-mers(k = {8,9,10,11,12}) and looking for over-represented motifs with extreme half-life value. For the k-mers method, 515 significant k-mers are selected with ranksum test and clustered by ClustalW which resulted in 51 clusters of motifs. on the other hand, 2 more motifs are found by grouping genes with extreme half-life and ran Gibbs sampler.

  2. 14 of them come from the study of Hogan et al. (2008) who have applied two related computational methods to identify candidates for the binding sites of 40 out of the more than 500 known and predicted RBPs in S. cerevisiae: (1)“finding informative regulatory elements” (FIRE) and (2)“relative filtering by nucleotide enrichment” (REFINE). The previous one searches for motifs with informative patterns of enrichment and the other one identifies all hexamers that are significantly enriched in untranslated regions, filters out regions of target sequences that are relatively devoid of such hexamers, and then applies the “multiple expectation maximization for motif elicitation” (MEME) motif-finding algorithm. As a result, 14 motifs that are likely to be the binding sites of 16 RBP are found.

  3. The rest 4 are from the study of Cheng (2017), four mRNA-stability related motifs in the 3′ UTR are found by De novo motif searching and their reliability and effect have been examined by linear mixed effect model, Fisher test P-value corrected with Benjamini–Hochberg, Wilcoxon rank-sum test and multivariate linear regression.

UTR sequence of 4388 genes from Cheng et al. 

The UTR sequence of 4388 genes are from Cheng et al. (2017) as well.

1868 gene expression level data loss after inner join above data by gene name.

After inner joining the expression level of 6152 genes with the 4388 motif frequency data which computed from 3’UTR data of Cheng’s study, expression level data of 1868 genes loss.

names_gene_information_loss = setdiff(expressionLevel_Gasch$geneName,UTR_cheng$genename)
length(names_gene_information_loss)
## [1] 1868

apply fix-length downstream sequence from fungiDB as 3’UTR for 1320 of the 1868 genes.

first, we check the mean length of 3’UTR from Cheng et al.

## [1] "median length of 3'UTR from Cheng et al. =  119"
## [1] "mean length of 3'UTR from Cheng et al. =  144.873290793072"

Since there is no 3’UTR annotation in most of the reliable online database, we download downstream sequence with fix-length (120 nucleotides) as 3’UTR. 1320 downstream sequence are downloaded. The length of 120 nucleotides is suggested by my supervisor Edward Wallance as a sensible number just below the median length.

UTR_fungiDB = read_csv("../data/3UTR_fungiDB")
## Parsed with column specification:
## cols(
##   genename = col_character(),
##   UTR3_seq = col_character()
## )
UTR_Cheng_fungiDB = rbind(UTR_cheng,UTR_fungiDB,fill=TRUE)

names_gene_cheng = UTR_cheng$genename
names_gene_fungiDB = UTR_fungiDB$genename

Construct the design matrix with additional data from fungiDB

# get UTR sequence from both dataset
UTR_3 <- UTR_Cheng_fungiDB$UTR3_seq

# Search and add frequency of each c(motif) as a column in ref dataset
ref_motifs <- tibble(geneName = UTR_Cheng_fungiDB$genename)
for (i in 1:length(motifs)){
  ref_motifs <- mutate(.data = ref_motifs,!!motifs_raw[i] := str_count(UTR_3, motifs[i]))
}


# merge motifs
fullTable_Gasch <- merge(expressionLevel_Gasch,ref_motifs,by = "geneName")

# convert type of motifs frequency to factor for violin plotting
fullTable_Gasch_factor = fullTable_Gasch
for(i in names_motifs_all){
    fullTable_Gasch_factor[[i]] <- as.factor(as.character(fullTable_Gasch_factor[[i]]))
}

check the motif frequency from both dataset “cheng” and “fungiDB”

fullTable_Gasch_original = fullTable_Gasch[fullTable_Gasch$geneName %in% names_gene_cheng,]
fullTable_Gasch_fungiDB = fullTable_Gasch[fullTable_Gasch$geneName %in% names_gene_fungiDB,]


summary_motif_frequency = data.frame(matrix(ncol = 3, nrow = 0))

for(m in names_motifs_all){
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = m,
    dataset = "Cheng",
    frequency = sum(fullTable_Gasch_original[[m]]),
    frequency_per_gene = sum(fullTable_Gasch_original[[m]])/(dim(fullTable_Gasch_original)[1])
  ))
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = m,
    dataset = "fungiDB",
    frequency = sum(fullTable_Gasch_fungiDB[[m]]),
    frequency_per_gene = sum(fullTable_Gasch_fungiDB[[m]])/(dim(fullTable_Gasch_fungiDB)[1])
  ))
}

Construct the design matrix with additional data from Samuel

UTR_Samuel <- read.csv(file="../data/genome_wide_120nt_3UTR_sequence.csv")
colnames(UTR_Samuel) = c("genename","UTR3_seq")

UTR_Cheng_fungiDB_Samuel = rbind(UTR_Cheng_fungiDB,UTR_Samuel,fill=TRUE)
# get UTR sequence from both dataset
UTR_3 <- UTR_Cheng_fungiDB_Samuel$UTR3_seq

# Search and add frequency of each c(motif) as a column in ref dataset
ref_motifs <- tibble(geneName = UTR_Cheng_fungiDB_Samuel$genename)
for (i in 1:length(motifs)){
  ref_motifs <- mutate(.data = ref_motifs,!!motifs_raw[i] := str_count(UTR_3, motifs[i]))
}


# merge motifs
fullTable_Gasch <- merge(expressionLevel_Gasch,ref_motifs,by = "geneName")

# convert type of motifs frequency to factor for violin plotting
fullTable_Gasch_factor = fullTable_Gasch
for(i in names_motifs_all){
    fullTable_Gasch_factor[[i]] <- as.factor(as.character(fullTable_Gasch_factor[[i]]))
}

8 motifs are removed from list considering their low frequency on 3’UTR

We matching the 69 motifs on every gene sequence to compute the frequency matrix which could be used as the design matrix of our models. The distribution of the frequency of each motif across all the genes is very unbalanced. 8 motifs have never been investigated from any of the 3’UTR from the study of Cheng et al. Therefore, we temporarily remove them.

Frequency number of motifs (n)
n = 0 8
0 < n < 6 9
5 < n < 21 26
n > 20 26

One possible explanation of the huge amount of low-frequency motifs is that it is hard to match exactly the regular expression to the UTR sequence. Another point is, the secondary-structure motifs are more likely to be found on 5’UTR rather than 3’UTR. Also, a motif with a significant effect on regulation is reasonable to have a small frequency.

Part 2: Assess the rationality of group lasso

Before applying multi-task learning, we would like to investigate the data and assess if it is reasonable to apply group lasso on them.

4 significant motifs on heat-shock-relevant environmental stress.

To investigate how the expression level changes linearly depends on the motifs, we calculate the Pearson correlation coefficient (PCC) between them. Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y: \[ p_{x,y}= {cov(X,Y)\over \sigma_X \sigma_Y} \] We start by picking a group of environmental conditions about Heat Shock from Various Temperatures to 37°C. Four motifs show a much higher correlation coefficient rather than the others, which, indicate the linear relationship between these motifs and expression level. One point worth mentioning is that all these motifs are also focused by Abhi where TGTATAWT is explained to be positively associated with RNA stability and the rest three of them are explained to be negatively associated with the RNA stability. However, Abhi is not quite confident with the conclusion of UGUAHMNUA as the negative relationship can only be derived from Chan’s data but not Sun’s data.

names_temperature_condition = vector(mode="character", length=5)
names_temperature_condition[1] = "hs_17to37_20min"
names_temperature_condition[2] = "hs_21to37_20min"
names_temperature_condition[3] = "hs_25to37_20min"
names_temperature_condition[4] = "hs_29to37_20min"
names_temperature_condition[5] = "hs_33to37_20min"

9 heat-shock-relevant motifs are selected from linear models

Before fitting the linear model, we temporarily remove the motifs with a total frequency lower than 5 since we are not confident enough to tell whether a low-frequency motif depends linearly on the heat-shock related environment condition. Then, we fitted the linear model with the lm() function with default least squares method: \[ \min_w \sum (w*e)^2 \]

## [1] "mean square error:"
## [1] 1.973801

9 heat-shock-relevant motifs are selected from the linear model with the following coefficient:

Motif Estimate Std.Error t value Pr(>|t|)
UGUAHMNUA -1.054 0.085 -12.369 <2e-16
ATATTC 0.309 0.057 5.432 5.90e-08
TGTATAWT -0.468 0.095 -4.907 9.62e-07
WUUGUAWUWU -0.534 0.157 -3.413 0.0006
UAAUAAUW -0.250 0.083 -3.025 0.0025
TGTAAATA 0.244 0.106 2.308 0.0211
AAAATAAAG -0.568 0.281 -2.018 0.0436
UUUAAUGA 0.302 0.162 1.864 0.0624
TCATGTAT -0.327 0.207 -1.580 0.1141

This table is sorted by Pr(>|t|) (also known as the p-value) which is the probability of achieving a |t| as large as or larger than the observed absolute t value if the null hypothesis (estimate = 0) was true. In short, it is the probability that there is no relationship between that feature and the predicted value (Ronald L. Wasserstein et al., 2016). Estimate shows the coefficient or weight gains by the responding features in this linear model. Std.Error is the standard error of the estimate value, which, can be used to calculate the Confidence interval. For example, the 95% confidence interval could be obtained from Estimate +- 1.96*Std.Error. t value is calculated from the estimates divided by their standard errors. To make the best use of this value, we need to look up the table of t distribution to learn the reject boundary. In this case, we could simply say the larger the magnitude of the t-value is, the less likely that the coefficient is 0.

# investigate coefficients
summary(model_hs_17to37)
## 
## Call:
## lm(formula = formula_model_hs_17to37, data = fullTable_Gasch)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6815 -0.6631  0.0185  0.6891  7.2160 
## 
## Coefficients: (4 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.041493   0.019599   2.117 0.034271 *  
## ATATTC           0.286250   0.037205   7.694 1.54e-14 ***
## TGCAT           -0.003710   0.028562  -0.130 0.896640    
## TGTAAATA         0.247406   0.069172   3.577 0.000349 ***
## TTTTTTA          0.048772   0.041515   1.175 0.240100    
## CHUGUAWAUAWAWA  -0.272768   0.218284  -1.250 0.211470    
## UGUAHMNUA       -1.041902   0.055178 -18.882  < 2e-16 ***
## WUUGUAWUWU      -0.376281   0.101379  -3.712 0.000207 ***
## UAAUAAUW        -0.073003   0.045165  -1.616 0.106042    
## WUAUAUAW         0.099146   0.033299   2.977 0.002913 ** 
## HWNCAUUWY       -0.014674   0.027010  -0.543 0.586942    
## DRARAMGMD       -0.020475   0.029980  -0.683 0.494650    
## AAUACCY          0.027576   0.094818   0.291 0.771184    
## UKWCGRGGN        0.366512   0.210492   1.741 0.081673 .  
## UUUAAUGA         0.382710   0.107962   3.545 0.000394 ***
## UUCUUGUW         0.097166   0.104560   0.929 0.352758    
## TATATATA        -0.006629   0.041879  -0.158 0.874240    
## TATTGTGTAGTA     4.476480   2.378234   1.882 0.059825 .  
## ATGTAGTAG       -0.766822   0.376848  -2.035 0.041891 *  
## GTAGTATTT       -0.941654   0.377136  -2.497 0.012544 *  
## GTAAAMAT        -0.120343   0.110465  -1.089 0.275992    
## TTTTGATGTAGTW   -0.171163   1.036841  -0.165 0.868883    
## GTATACCTA        0.691261   0.368356   1.877 0.060596 .  
## TATTATTAT        0.139838   0.062062   2.253 0.024265 *  
## TGTTTTGATTT     -0.112483   0.997501  -0.113 0.910219    
## TCAGTGATTATG     0.750659   0.849358   0.884 0.376825    
## AGGAAATTG        0.462287   0.469861   0.984 0.325195    
## ATATTGTAAGAAA          NA         NA      NA       NA    
## CGGGTAAG        -0.366438   0.498956  -0.734 0.462714    
## ATAAAAAG        -0.036517   0.095521  -0.382 0.702253    
## AAAAAGACA       -0.182533   0.236651  -0.771 0.440536    
## CAAGTGGGW        0.459818   0.576968   0.797 0.425493    
## CTACTTCA         0.292598   0.223114   1.311 0.189738    
## YTCTCTCTC       -0.277809   0.229238  -1.212 0.225583    
## TGTATAWT        -0.394746   0.063441  -6.222 5.07e-10 ***
## TATGTATTGT      -1.033652   0.498340  -2.074 0.038084 *  
## GTGGTATGT       -0.013965   0.642791  -0.022 0.982667    
## CGCTATTG         1.023888   0.376720   2.718 0.006580 ** 
## TCATGTAT        -0.268727   0.128503  -2.091 0.036532 *  
## CAGCAACT         0.784740   0.391030   2.007 0.044788 *  
## ATACGGTGT        0.222377   0.657901   0.338 0.735362    
## ACGTTGCT         0.162931   0.364094   0.447 0.654525    
## TCTTTCCG         0.545450   0.288286   1.892 0.058510 .  
## TTTTCTAGGDD     -0.374415   0.470090  -0.796 0.425773    
## GTTTGTTG        -0.659629   0.277198  -2.380 0.017346 *  
## RRAGACAAGT      -0.486849   0.413210  -1.178 0.238736    
## WWTMGTATATTGTMA -2.371029   0.997519  -2.377 0.017474 *  
## WRWYACGAAAGTMA         NA         NA      NA       NA    
## TWGTATAGTA       0.030453   0.317079   0.096 0.923488    
## AAAATAAAG       -0.401311   0.176039  -2.280 0.022646 *  
## GTCAGTGMTTA            NA         NA      NA       NA    
## RWWGTTACGAAA    -0.259164   0.579768  -0.447 0.654874    
## WDYDTTTTGATGTA         NA         NA      NA       NA    
## RWATATGCGTTTT    0.150464   0.835925   0.180 0.857158    
## HUUUUUUHW        0.027382   0.030327   0.903 0.366596    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.408 on 11529 degrees of freedom
##   (125 observations deleted due to missingness)
## Multiple R-squared:  0.06315,    Adjusted R-squared:  0.05909 
## F-statistic: 15.54 on 50 and 11529 DF,  p-value: < 2.2e-16

Identify potential relationships between tasks

To penalty the irrelevant features and ensure the comparison across the models, we plan to apply group lasso. However, before applying the multi-task learning method, we want to ensure there are potential relationships between tasks.

motif_temperature_sensitive = c('UGUAHMNUA','TGTATAWT','ATATTC','TGTAAATA')
names_temperature_condition = vector(mode="character", length=5)
names_temperature_condition[1] = "hs_15min_hs-1"
names_temperature_condition[2] = "hs_30min_hs-1"
names_temperature_condition[3] = "hs_40min_hs-1"
names_temperature_condition[4] = "hs_60min_hs-1"
names_temperature_condition[5] = "hs_80min_hs-1"

# pick a group of temperature condition
names_temperature_condition = vector(mode="character", length=3)
names_temperature_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
names_temperature_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
names_temperature_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"

names_temperature_condition = vector(mode="character", length=5)
names_temperature_condition[1] = "hs_37to25_15min"
names_temperature_condition[2] = "hs_37to25_30min"
names_temperature_condition[3] = "hs_37to25_45min"
names_temperature_condition[4] = "hs_37to25_60min"
names_temperature_condition[5] = "hs_37to25_90min"

Part 3: apply group lasso

introduction

To investigate and quantify how the cis-regulatory element affects the gene regulation, we built a series of linear models where each model takes the frequencies of cis-regulatory elements as input and predicts how the gene expression level under certain environmental stress. Their coefficient can be used to estimate the regulatory effect of both a single cis-regulatory element or all the cis-regulatory as a group. For example, if an element is given a relatively large coefficient in a model which predicts the change of expression level under certain environmental stress, we can say the element play a relatively important role in the regulation under that environmental stress. On the other hand, the bias of the model can be considered as the expected gene expression level change of a gene with no cis-regulatory element we considered in the model. In other words, the bias indicates the expression level change that cannot be explained by the model and can be used to estimate the proportion of the efforts on regulation these cis-regulatory elements make.

At this stage, our aim is to apply group lasso as a multi-task learning method to training all the models together which allows the comparison across each model and ensures their generalization. The L2 penalty term can filter out most of the irrelevant factors by penalizing their coefficient towards zero.

The penalty on the coefficient vector for variable j is: \[ (1-\alpha)/2||\beta_j||_2^2+\lambda||\beta_j||_2. \]

distribution of missing value

Above all the expression level data, around 3% of them are missing value, Where most of the genes (94.8%) have a relatively complete expression levels’ data (90%) while 41 genes (0.7%) have missing valves under more than quarter of environmental stresses.

number of genes missing value (n)
755 (12.3%) 0
4372 (71.1%) 0 < n < 9 (5%)
701 (11.4%) 8 < n < 18 (10%)
283 (4.6%) 17 < n < 44 (25%)
41 (0.7%) 43 < n < 99 (58%)

The 41 genes with missing values larger than 43 and smaller than 99

names_genes = expressionLevel_Gasch$geneName
names_genes[na_count_sum > 43 & na_count_sum <99]
##  [1] "YAL058C-A" "YAR066W"   "YBR003W"   "YBR151W"   "YBR153W"   "YBR155W"  
##  [7] "YBR157C"   "YBR159W"   "YDL036C"   "YDL130W"   "YDL206W"   "YDL207W"  
## [13] "YDL208W"   "YDL209C"   "YDL210W"   "YDL211C"   "YDL212W"   "YDL213C"  
## [19] "YDL214C"   "YDL216C"   "YDL217C"   "YDR430C"   "YDR434W"   "YEL076W-C"
## [25] "YER066C-A" "YFL013W-A" "YHR149C"   "YIL015C-A" "YIL066C"   "YIL067C"  
## [31] "YIL069C"   "YIL070C"   "YIL072W"   "YIL075C"   "YIL076W"   "YJL012C"  
## [37] "YJL150W"   "YJL194W"   "YML084W"   "YMR117C"   "YOR304C-A"

On the other hand, the distribution of missing value by 173 environmental stress indicates that there are 14 environmental stress that have a relatively higher missing value ratio (larger than 10% but smaller than 23%)

number of environmental stresses missing value (n)
0 0
55 (31.8%) 0 < n < 62 (1%)
97 (56.1%) 61 < n < 308 (5%)
7 (4.0%) 307 < n < 616 (10%)
14 (8.1%) 615 < n < 1385 (23%)

the 14 environmental stresses with missing value ratio larger than 10% are listed here

na_count_sum[na_count_sum>315]
##                           hs_05min_hs-1                           hs_10min_hs-1 
##                                     893                                     704 
##                           hs_15min_hs-1                           hs_20min_hs-1 
##                                     500                                     780 
##                           hs_30min_hs-1                           hs_40min_hs-1 
##                                     783                                    1305 
##                           hs_60min_hs-1                           hs_80min_hs-1 
##                                     726                                    1384 
##                0.32mM_H2O2_40min_rescan                  2.5mM_DTT_005min_dtt-1 
##                                     509                                    1002 
##                  2.5mM_DTT_015min_dtt-1                  2.5mM_DTT_030min_dtt-1 
##                                    1162                                    1313 
##                  2.5mM_DTT_045min_dtt-1                  2.5mM_DTT_060min_dtt-1 
##                                     597                                     938 
##                  2.5mM_DTT_090min_dtt-1                  2.5mM_DTT_180min_dtt-1 
##                                     591                                     578 
##           YPD_stationary_phase_1d_ypd-1          YPD_stationary_phase_28d_ypd-1 
##                                     723                                    1043 
## steady state 36 dec C ct-2 (repeat hyb) 
##                                     697

filter out gene with high value-missing ratio

The gene with high missing value could be caused by low abundance of mRNA in the cell which means there could be large errors for the measurements on these gene. It is better to remove them first.

na_count_sum = rowSums(is.na(expressionLevel_Gasch[4:176]))
expressionLevel_Gasch_filtered = expressionLevel_Gasch[na_count_sum<18,]

fill missing data by imputation

Applying imputation to estimate the missing values should be a much better way rather than simply ignore the missing values or replacing them by mean or 0 since it reduce the bias. Olga Troyanskaya et al.(2001) suggested that weighted K-nearest neighbors (KNNimpute) could be another sensitive method to deal with the missing value.

According to their article, when we consider gene A that has one missing value under environmental stress X, we could look for how the expression level of gene A change in other environmental stress Y and looking for k other genes B that show similar behavior as gene A. Then we fill the missing value by the weighted average of genes B. The contribution of each gene is weighted by similarity of its expression to that of gene A.

we can estimate the performance of imputation by randomly knocking out 0.1% of non-NA and check the mean square error between estimation and actual value.

k_value_validation_df = data.frame(matrix(ncol = 2, nrow = 0))

k_list = c(2,5,8,11,14,17,20,23,27,30)
size_test = 1000

for(k in k_list){
  for(randomSeed in 362136666:362136675){
    set.seed(randomSeed)
    
    
    row_numbers = c()
    column_numbers = c()
    
    matrix = as.matrix(expressionLevel_Gasch_filtered[4:length(expressionLevel_Gasch_filtered)])
    matrix_complete = matrix
    
    set.seed(randomSeed)
    row_random = sample(1:dim(matrix)[1],size_test*2,replace = TRUE)
    column_random = sample(1:dim(matrix)[2],size_test*2,replace = TRUE)
    
    
    i = 1
    j = 1
    while (i <= round(size_test*2) & j <= size_test) {
      if(!is.na(matrix[row_random[i],column_random[i]])){
        row_numbers = c(row_numbers,row_random[i])
        column_numbers = c(column_numbers,column_random[i])
        matrix[row_random[i],column_random[i]] = NA
        j = j + 1
      }
      i = i + 1
    }
    
    
    expressionLevel_impute_knn = (impute.knn(matrix ,k = k, rowmax = 0.5, colmax = 0.8, maxp = 7000, rng.seed=362436069))$data
    
    i = 1
    se = 0
    while (i < j) {
      se = se + (as.double(matrix_complete[row_numbers[i],column_numbers[i]]) - as.double(expressionLevel_impute_knn[row_numbers[i],column_numbers[i]]))^2
      i = i + 1
    }
    
    k_value_validation_df = rbind(k_value_validation_df, c(k,se/j))
    print(paste("k = ",k," mse = ",se/j))
  }
}


colnames(k_value_validation_df) = c("k","mse")
k_value_validation_df$k = as.integer(k_value_validation_df$k)

k_value_validation_ds <- do.call(rbind, lapply(split(k_value_validation_df, k_value_validation_df$k), function(d) {
  data.frame(mean = mean(d$mse), sd = sd(d$mse), k = d$k)
}))

an optimal value for k should sit between 5 and 20. In this case, we pick k = 11 for our imputation as a conserved and high-performance hyperparameter.

baseline performance: simply replace missing value with mean of column

As one of the easiest approach to deal with the missing values, we could simply replace them by mean of that column To estimate of the performance, we randomly knock out 0.1% of non-NA and check the mean square error.

baseline_df = data.frame(matrix(ncol = 1, nrow = 0))
expressionLevel_matrix = as.matrix(expressionLevel_Gasch_filtered[4:length(expressionLevel_Gasch_filtered)])
storage.mode(expressionLevel_matrix) <- "double"
expressionLevel_column_mean = colMeans(expressionLevel_matrix,na.rm = TRUE)

for(randomSeed in 362136666:362136675){
  size_test = 1000
  row_numbers = c()
  column_numbers = c()
  
  
  set.seed(randomSeed)
  row_random = sample(1:dim(expressionLevel_matrix)[1],size_test*2,replace = TRUE)
  column_random = sample(1:dim(expressionLevel_matrix)[2],size_test*2,replace = TRUE)
  
  i = 1
  j = 1
  while (i <= round(size_test*2) & j <= size_test) {
    if(!is.na(expressionLevel_matrix[row_random[i],column_random[i]])){
      row_numbers = c(row_numbers,row_random[i])
      column_numbers = c(column_numbers,column_random[i])
      j = j + 1
    }
    i = i + 1
  }
  
  
  se = 0
  for (i in 1:(j-1)) {
    se = se + (expressionLevel_matrix[row_numbers[i],column_numbers[i]] - expressionLevel_column_mean[column_numbers[i]])^2
  }
  
  baseline_df = rbind(baseline_df, c(se/j))
  print(paste("estimate mse = ",se/j))
  
}
## [1] "estimate mse =  0.660283266661971"
## [1] "estimate mse =  0.583398660384386"
## [1] "estimate mse =  0.6700609834903"
## [1] "estimate mse =  0.577084282879705"
## [1] "estimate mse =  0.586884694563133"
## [1] "estimate mse =  0.614764595154343"
## [1] "estimate mse =  0.668870637824216"
## [1] "estimate mse =  0.649682225303615"
## [1] "estimate mse =  0.747276988365045"
## [1] "estimate mse =  0.710807313618546"
colnames(baseline_df) = c("mse")
baseline_ds = data.frame(matrix(ncol = 2, nrow = 0))
baseline_ds = rbind(baseline_ds, c(mean(baseline_df$mse),sd(baseline_df$mse)))
colnames(baseline_ds) = c("mean","sd")
ggplot() +
  geom_point(data = baseline_df, aes("NA to column mean", mse)) +
  geom_point(data = k_value_validation_df[k_value_validation_df$k == 11,], aes("kNN imputation (k=11)",mse)) +
  geom_point(data = baseline_ds, aes("NA to column mean", mean), colour = 'red', size = 3) +
  geom_point(data = k_value_validation_ds[k_value_validation_ds$k == 11,], aes("kNN imputation (k=11)", mean), colour = 'red', size = 3) +
  geom_errorbar(
    data = baseline_ds,
    aes("NA to column mean", mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.2
  ) + 
  geom_errorbar(
    data = k_value_validation_ds[k_value_validation_ds$k == 11,],
    aes("kNN imputation (k=11)", mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.2
  ) +
  theme(axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold")) +
  labs(x = "")+
  ylim(0,0.8)

consider motifs with non-zero frequency

motifs_count_sum <- colSums(fullTable_Gasch[names_motifs_all],na.rm=TRUE)
remove_list = NULL

for (i in names(motifs_count_sum)){
  if (motifs_count_sum[[i]] == 0){
    remove_list <- c(remove_list,i)
  }
}

names_motifs_valid <- names_motifs_all[!names_motifs_all %in% remove_list]

choose lambda from 10-fold cross validation

We need to set the hyperparameter lambda to control the effect of penalization. A common approch to this is by cross validation.

cv_models_all = cv.glmnet(x, y, family = "mgaussian")
plot(cv_models_all)

lambda.min is the value of lambda that gives minimum cvm; lambda.1se is the largest value of lambda such that error is within 1 standard error of the minimum; The confidence intervals represent error estimates for the loss metric (red dots). They’re computed using cross validation. The vertical dashed lines show the locations of λmin and λ1se. The numbers across the top are the number of nonzero coefficient estimates. The error is accumulated, and the average error and standard deviation over the folds is computed.

training model with selected lambda

training model with λ = lambda.min results in a model with highest expected performancec while training model with λ = lambda.1se results in a more conservative model with fewer non-zero regression coefficients.

# to skip cross validation
lambda.min = 0.108274493279888
lambda.1se = 0.695996358333947

models_all_minLambda = glmnet(x, y, family = "mgaussian", lambda = lambda.min)
models_all_1seLambda = glmnet(x, y, family = "mgaussian", lambda = lambda.1se)

#models_all_minLambda = glmnet(x, y, family = "mgaussian", lambda = cv_models_all$lambda.min)
#models_all_1seLambda = glmnet(x, y, family = "mgaussian", lambda = cv_models_all$lambda.1se)

#paste("lambda.min =",cv_models_all$lambda.min)
#paste("lambda.1se =",cv_models_all$lambda.1se)

the penalize effect with λ = lambda.1se is much stronger than that with λ = lambda.min

We summary coefficients from all the models together to compare the penalized effect of models with different λ values. The model with λ = lambda.1se penalizes more than 90% of coefficients to 0 while the model with λ = lambda.min keep more than half of its coefficients non-zero.

regulatory effect of motifs decrease as time pass in heat shock

Here we plot a graph to show how the coefficient given to each motif change across the timeline after heat shock. For most of the motifs, their coefficients increase with shortly delay and reach their peak at 15mins. Than they gradually shift to 0 as time past. An explanation of the 15mins delay is that the cis-regulatory elements could regulate the expression level by adjusting the stability of mRNA which may need some time to decay.

plot_models_line(models_all_minLambda, names_temperature_condition, 10)

Part 4: analyze the regression coefficient

divide environmental condition into groups

With the models from group lasso, we can carefully select the regression coefficient from different groups of environmental conditions and compare them. By calculating the sum and L2 norm of their coefficient under each group of environmental condition, we are able to compare how the regulatory effect of each motif differ in each environmental stress.

{
  names_heatshock_29to33_condition = vector(mode="character", length=3)
  names_heatshock_29to33_condition[1] = "hs_29to33_05min"
  names_heatshock_29to33_condition[2] = "hs_29to33_15min"
  names_heatshock_29to33_condition[3] = "hs_29to33_30min"
}

{
  names_heatshock_sorbitol_condition = vector(mode="character", length=3)
  names_heatshock_sorbitol_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
  names_heatshock_sorbitol_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
  names_heatshock_sorbitol_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
} 

{
  names_hypothermia_condition = vector(mode="character", length=3)
  names_hypothermia_condition[1] = "hs_37to25_15min"
  names_hypothermia_condition[2] = "hs_37to25_30min"
  names_hypothermia_condition[3] = "hs_37to25_60min"
} 

{
  names_heatshock_25to37_condition = vector(mode="character", length=3)
  names_heatshock_25to37_condition[1] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_60min_hs-1"
} 
summary_motif_frequency = data.frame(matrix(ncol= 2, nrow = 0))

for(motif in names_motifs_valid){
  summary_motif_frequency = rbind(summary_motif_frequency, data.frame(
    motif = motif,
    transcriptAmount = log10(sum(fullTable_Gasch[[motif]] > 0))
  ))
}
summary_coefficient = merge(x = summary_coefficient, y = summary_motif_frequency, by = "motif")

L2 norm of the non-zero regression coefficient from significant motif under all condition

The following graph shows the L2 norm of the regression coefficient of some significant motif under all environmental conditions. The rank of each motif shows how significant a motif is in the environmental conditions we considered. However, since the models we fitted here only considered some typical environmental stress on yeast and is very unbalanced in the type of condition (about 40 model is temperature relevant), we couldn’t say the high-rank motif shown here is significant under general environmental stress.

compare L2 nore and sum of the non-zero regression coefficient from significant motif under heat shock and hypothermia

the regulatory effect of each motif under hypothermia is much smaller than that under heat shock. This could probably reflect the trigger of environmental stress response (ESR) mention by Gasch et al. (2000): “the ESR is not initiated in response to all environmental changes and that the large, transient changes in expression that are characteristic of the ESR are only seen when this response is initiated and not in the reciprocal response to diminished environmental stress.”

compare L2 norm and sum of the non-zero regression coefficient from significant motif under heat shock, heat shock with sorbitol

Although most of the regression coefficients are similar in both heat shock with and without sorbitol, there are some motifs show a relatively different regulatory effect on gene expression with sorbitol. Check the second graph Sum of the non-zero regression coefficient of significant motifs under heat shock with and without sorbitol after 5, 15 and 30 min, it is easy to see TGAGGCTA shows a much stronger regulatory effect on gene expression when there is sorbitol while UKWCGRGGN and GTAGTATTT acting in an opposite way, a much weaker regulatory effect under heat shock with sorbitol.

investigate the transcript frequency of each motif against their coefficient

order_motifs = summary_coefficient[summary_coefficient$condition == "all",]
order_motifs = order_motifs[order(-order_motifs$transcriptAmount),]$motif

box graph for cross comparison between two type of environmental stress

To compare how the coefficient of motif from two different type of environmental stress, we plot a box graph show the difference between the coefficients from two group of models.

{
  names_heatshock_29to33_condition = vector(mode="character", length=3)
  names_heatshock_29to33_condition[1] = "hs_29to33_05min"
  names_heatshock_29to33_condition[2] = "hs_29to33_15min"
  names_heatshock_29to33_condition[3] = "hs_29to33_30min"
}

{
  names_heatshock_sorbitol_condition = vector(mode="character", length=3)
  names_heatshock_sorbitol_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
  names_heatshock_sorbitol_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
  names_heatshock_sorbitol_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
} 

{
  names_heatshock_25to37_condition = vector(mode="character", length=8)
  names_heatshock_25to37_condition[1] = "hs_05min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_10min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[4] = "hs_20min_hs-1"
  names_heatshock_25to37_condition[5] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[6] = "hs_40min_hs-1"
  names_heatshock_25to37_condition[7] = "hs_60min_hs-1"
  names_heatshock_25to37_condition[8] = "hs_80min_hs-1"
}

{
  names_heatshock_37to25_condition = vector(mode="character", length=5)
  names_heatshock_37to25_condition[1] = "hs_37to25_15min"
  names_heatshock_37to25_condition[2] = "hs_37to25_30min"
  names_heatshock_37to25_condition[3] = "hs_37to25_45min"
  names_heatshock_37to25_condition[4] = "hs_37to25_60min"
  names_heatshock_37to25_condition[5] = "hs_37to25_90min"
}

comparison motifs rank between each group of models

As another approach from comparing the motifs’ regulatory effect between different type of environmental stresses, we compare the absolute value of their regression coefficients. Motifs are ranked in each model where rank 1 is given to the motifs with highest absolute value of coefficient. Then, for each group of environmental stress, we compute the average rank of each motif weighted by L2 of the coefficients in each model. Therefore, the rank from a model with motifs with low regression coefficients will get less weight rather than that of a model with motifs with high regression coefficients.

# prepare data
summary_coefficient_rank = data.frame(matrix(ncol = 4, nrow = 0))

for(environmental_stress in names_environmental_stress){
  coefficients = extract_coefficient(models_all_minLambda,environmental_stress)
  #coefficients = coefficients[order(abs(coefficients$value)),]
  coefficients = coefficients[order(-abs(coefficients$value)),]

  i = 1
  while (i <= dim(coefficients)[1]) {
    summary_coefficient_rank = rbind(summary_coefficient_rank, data.frame(
    motif = coefficients[i]$rn,
    rank = i,
    condition = environmental_stress,
    model.L2 = sqrt(sum(coefficients$value^2))
    ))
    i = i+1
  }
}
{
  names_heatshock_25to37_condition = vector(mode="character", length=8)
  names_heatshock_25to37_condition[1] = "hs_05min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_10min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[4] = "hs_20min_hs-1"
  names_heatshock_25to37_condition[5] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[6] = "hs_40min_hs-1"
  names_heatshock_25to37_condition[7] = "hs_60min_hs-1"
  names_heatshock_25to37_condition[8] = "hs_80min_hs-1"
}

{
  names_heatshock_37to25_condition = vector(mode="character", length=5)
  names_heatshock_37to25_condition[1] = "hs_37to25_15min"
  names_heatshock_37to25_condition[2] = "hs_37to25_30min"
  names_heatshock_37to25_condition[3] = "hs_37to25_45min"
  names_heatshock_37to25_condition[4] = "hs_37to25_60min"
  names_heatshock_37to25_condition[5] = "hs_37to25_90min"
}

{
  names_sorbitol_condition = vector(mode="character", length=7)
  names_sorbitol_condition[1] = "1M_sorbitol_05min"
  names_sorbitol_condition[2] = "1M_sorbitol_15min"
  names_sorbitol_condition[3] = "1M_sorbitol_30min"
  names_sorbitol_condition[4] = "1M_sorbitol_45min"
  names_sorbitol_condition[5] = "1M_sorbitol_60min"
  names_sorbitol_condition[6] = "1M_sorbitol_90min"
  names_sorbitol_condition[7] = "1M_sorbitol_120min"
}

{
  names_Nitrogen_condition = vector(mode="character", length=7)
  names_Nitrogen_condition[1] = "Nitrogen_Depletion_30min"
  names_Nitrogen_condition[2] = "Nitrogen_Depletion_1h"
  names_Nitrogen_condition[3] = "Nitrogen_Depletion_2h"
  names_Nitrogen_condition[4] = "Nitrogen_Depletion_4h"
  names_Nitrogen_condition[5] = "Nitrogen_Depletion_8h"
  names_Nitrogen_condition[6] = "Nitrogen_Depletion_12h"
  names_Nitrogen_condition[7] = "Nitrogen_Depletion_1d"
}

{
  names_diamide_condition = vector(mode="character", length=8)
  names_diamide_condition[1] = "1.5mM_diamide_5min"
  names_diamide_condition[2] = "1.5mM_diamide_10min"
  names_diamide_condition[3] = "1.5mM_diamide_20min"
  names_diamide_condition[4] = "1.5mM_diamide_30min"
  names_diamide_condition[5] = "1.5mM_diamide_40min"
  names_diamide_condition[6] = "1.5mM_diamide_50min"
  names_diamide_condition[7] = "1.5mM_diamide_60min"
  names_diamide_condition[8] = "1.5mM_diamide_90min"
}




{
  names_dithiothrietol_condition = vector(mode="character", length=8)
  names_dithiothrietol_condition[1] = "2.5mM_DTT_005min_dtt-1"
  names_dithiothrietol_condition[2] = "2.5mM_DTT_015min_dtt-1"
  names_dithiothrietol_condition[3] = "2.5mM_DTT_030min_dtt-1"
  names_dithiothrietol_condition[4] = "2.5mM_DTT_045min_dtt-1"
  names_dithiothrietol_condition[5] = "2.5mM_DTT_060min_dtt-1"
  names_dithiothrietol_condition[6] = "2.5mM_DTT_090min_dtt-1"
  names_dithiothrietol_condition[7] = "2.5mM_DTT_120min_dtt-1"
  names_dithiothrietol_condition[8] = "2.5mM_DTT_180min_dtt-1"
}

{
  names_meanadione_condition = vector(mode="character", length=8)
  names_meanadione_condition[1] = "1mM_Menadione_10min_redo"
  names_meanadione_condition[2] = "1mM_Menadione_20min_redo"
  names_meanadione_condition[3] = "1mM_Menadione_30min_redo"
  names_meanadione_condition[4] = "1mM_Menadione_40min_redo"
  names_meanadione_condition[5] = "1mM_Menadione_50min_redo"
  names_meanadione_condition[6] = "1mM_Menadione_80min_redo"
  names_meanadione_condition[7] = "1mM_Menadione_120min_redo"
  names_meanadione_condition[8] = "1mM_Menadione_160min_redo"
}

{
  names_H2O2_condition = vector(mode="character", length=10)
  names_H2O2_condition[1] = "0.32mM_H2O2_10min_redo"
  names_H2O2_condition[2] = "0.32mM_H2O2_20min_redo"
  names_H2O2_condition[3] = "0.32mM_H2O2_30min_redo"
  names_H2O2_condition[4] = "0.32mM_H2O2_40min_rescan"
  names_H2O2_condition[5] = "0.32mM_H2O2_50min_redo"
  names_H2O2_condition[6] = "0.32mM_H2O2_60min_redo"
  names_H2O2_condition[7] = "0.32mM_H2O2_80min_redo"
  names_H2O2_condition[8] = "0.32mM_H2O2_100min_redo"
  names_H2O2_condition[9] = "0.32mM_H2O2_120min_redo"
  names_H2O2_condition[10] = "0.32mM_H2O2_160min_redo"
}
order the motif
plot graph

## Warning: Removed 5 rows containing missing values (geom_bar).

investigating bias and expression mean of each model

The bias of the model can be considered as the expected gene expression level change of a gene with no cis-regulatory element we considered in the model. In other words, the bias indicates the expression level change that cannot be explained by the model and could indicate the estimated proportion of the efforts on regulation these cis-regulatory elements make.

Select motif with weighted square L2

In order to focus only on the significant motifs, we calculate the variance predict by motif for each motif over all the conditions. In this case, we consider no only the regression coefficient but also the transcript frequence: $$ \ \ \[2ex] e_{mg}= \[\begin{cases} 1,\quad \text{at least one motif } m \text{ is on gene }g\\ 0, \quad otherwise \end{cases}\]

\[6ex]

\[\begin{align} \text{Variance pridict by motif} &= \sum_c\sum_g \beta _{mc}^2 \times e_{mg}\\ &= \beta^2_{mc} \times transcript\ frequency \\ &= L_2^2\times transcript\ frequency \end{align}\]

$$

group conditions
{
  names_heatshock_29to33_condition = vector(mode="character", length=3)
  names_heatshock_29to33_condition[1] = "hs_29to33_05min"
  names_heatshock_29to33_condition[2] = "hs_29to33_15min"
  names_heatshock_29to33_condition[3] = "hs_29to33_30min"
}

{
  names_heatshock_sorbitol_condition = vector(mode="character", length=3)
  names_heatshock_sorbitol_condition[1] = "29C(1M_sorbitol)~33C(1M_sorbitol)_05min"
  names_heatshock_sorbitol_condition[2] = "29C(1M_sorbitol)~33C(1M_sorbitol)_15min"
  names_heatshock_sorbitol_condition[3] = "29C(1M_sorbitol)~33C(1M_sorbitol)_30min"
}


{
  names_heatshock_25to37_condition = vector(mode="character", length=4)
  names_heatshock_25to37_condition[1] = "hs_05min_hs-1"
  names_heatshock_25to37_condition[2] = "hs_15min_hs-1"
  names_heatshock_25to37_condition[3] = "hs_30min_hs-1"
  names_heatshock_25to37_condition[4] = "hs_60min_hs-1"
}

{
  names_heatshock_37to25_condition = vector(mode="character", length=4)
  names_heatshock_37to25_condition[1] = "hs_37to25_15min"
  names_heatshock_37to25_condition[2] = "hs_37to25_30min"
  names_heatshock_37to25_condition[3] = "hs_37to25_45min"
  names_heatshock_37to25_condition[4] = "hs_37to25_60min"
}

{
  names_sorbitol_condition = vector(mode="character", length=5)
  names_sorbitol_condition[1] = "1M_sorbitol_05min"
  names_sorbitol_condition[2] = "1M_sorbitol_15min"
  names_sorbitol_condition[3] = "1M_sorbitol_30min"
  names_sorbitol_condition[4] = "1M_sorbitol_45min"
  names_sorbitol_condition[5] = "1M_sorbitol_60min"
}

{
  names_Nitrogen_condition = vector(mode="character", length=2)
  names_Nitrogen_condition[1] = "Nitrogen_Depletion_30min"
  names_Nitrogen_condition[2] = "Nitrogen_Depletion_1h"
}

{
  names_diamide_condition = vector(mode="character", length=7)
  names_diamide_condition[1] = "1.5mM_diamide_5min"
  names_diamide_condition[2] = "1.5mM_diamide_10min"
  names_diamide_condition[3] = "1.5mM_diamide_20min"
  names_diamide_condition[4] = "1.5mM_diamide_30min"
  names_diamide_condition[5] = "1.5mM_diamide_40min"
  names_diamide_condition[6] = "1.5mM_diamide_50min"
  names_diamide_condition[7] = "1.5mM_diamide_60min"
}


{
  names_dithiothrietol_condition = vector(mode="character", length=5)
  names_dithiothrietol_condition[1] = "2.5mM_DTT_005min_dtt-1"
  names_dithiothrietol_condition[2] = "2.5mM_DTT_015min_dtt-1"
  names_dithiothrietol_condition[3] = "2.5mM_DTT_030min_dtt-1"
  names_dithiothrietol_condition[4] = "2.5mM_DTT_045min_dtt-1"
  names_dithiothrietol_condition[5] = "2.5mM_DTT_060min_dtt-1"
}

{
  names_meanadione_condition = vector(mode="character", length=5)
  names_meanadione_condition[1] = "1mM_Menadione_10min_redo"
  names_meanadione_condition[2] = "1mM_Menadione_20min_redo"
  names_meanadione_condition[3] = "1mM_Menadione_30min_redo"
  names_meanadione_condition[4] = "1mM_Menadione_40min_redo"
  names_meanadione_condition[5] = "1mM_Menadione_50min_redo"
}

{
  names_H2O2_condition = vector(mode="character", length=6)
  names_H2O2_condition[1] = "0.32mM_H2O2_10min_redo"
  names_H2O2_condition[2] = "0.32mM_H2O2_20min_redo"
  names_H2O2_condition[3] = "0.32mM_H2O2_30min_redo"
  names_H2O2_condition[4] = "0.32mM_H2O2_40min_rescan"
  names_H2O2_condition[5] = "0.32mM_H2O2_50min_redo"
  names_H2O2_condition[6] = "0.32mM_H2O2_60min_redo"
}
manage plotting order
plotting the weighted square L2 of each motif

select motifs

We selected the 6 motifs with the highest weighted L2 value. They should be reliable significant motifs under the environmental conditions we considered.

names_motifs_significant = head(order_motifs,6)

Compare mean of L2 of significant motifs across different group of environmental stress

First, We align the models by only considering the model with time slices within 5min to 60mins. For each group of environmental stress, we calculate the mean of L2 to allow cross comparison across environmental stresses group with different number of models:

\[ mean\ L_2 = \sqrt{{\sum_c^N \beta _{mc}^2}\over \text{N}} \]

compare proportion of variance explain by each motif

However, the overall regulatory effect in a cell also depends on how strong the environmental stress is. For example, the change in expression level under heat shock from 25 degree to 37 degree should be much larger than that under heat shock from 29 degree to 33 degree. Therefore, we calculate the proportion of variance explain by motifs over the total variance to reduce such difference

We plot bar graphs to compare the value across each group of environmental stresses. The shape of the proportion variance graph is similar to that of the mean of L2. There is still a large difference between the value of heat shock from 29C to 33C and heat shock from 25C to 37C.

\[ \begin{align} \text{Variance pridict by motif} &= \sum_c({1\over G}\times \sum_g \beta _{mc}^2 \times e_{mg} - ({1\over G}\times\sum_g \beta _{mc} \times e_{mg})^2)\\ &= \sum_c (\beta^2_{mc}\ \times\ percentage\ of\ transcript\ frequence- (\beta_{mc} \times percentage\ of\ transcript\ frequence)^2) \\ \end{align} \]

\[ \text{proportion of variance explain by motif } = {\text{variance explain by motifs}\over \text{total variance}} \]

compute the total variance of expression level in each group